Project: Wrangling and Analyze Data

Data Gathering

In the cell below, gather the data needed for this project and load them in the notebook.

  1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv) due to API issues we use the json file
  1. Use the Requests library to download the tweet image prediction (image_predictions.tsv)
  1. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

Assessing Data

In this section, we detect and document quality issues and tidiness issue.

  1. visual assessment
  2. programmatic assessement

Assess twitter_archive data

Assess the dataframe visually and programmatically to identify any quality or tidyness

Assess twitter_api_data dataframe

Assess image_prediction dataframe

Quality issues

Twitter_Archive_Data
  1. Missing values in columns: in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp, expanded_urls

  2. 23 rating denominators are not equal to 10

  3. So many rating numerators are less than 10 and there are many outrageous numerators

  4. Some dogs have no names but have uncharacteristic names given in the name column such as 'a', 'the', 'my' and 'old'

  5. 'Timestamp' has wrong (string) datatype

  6. Source and text info to be dropped

Twitter_Api_Data
  1. 'Id' column in twitter_api_data and tweet_id contain same data but with different column names. 'id' should be renamed
Twitter_archive_data
  1. Date should be extracted from timestamp column and saved in new column 'date'. Timestamp column should be dropped.

Tidiness issues

  1. The different names of dog stages in individual columns instead of being in a single column

  2. The twitter_archive_data and twitter_api_data must be merged

Cleaning Data

In this section, clean all of the issues you documented while assessing.

Note: Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of tidy data. The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

Quality issues

Issue #1:

Missing values in columns: in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp, expanded_urls

Define

Instead of trying to fill the NAs with mean values, drop all the above columns altogether from the dataframe

Code

Test

Issue #2:

23 rating denominators are not equal to 10

Define: Remove all denominators that are not equal to 10

Code

Test

Issue #3: So many rating numerators are less than 10 and there are many outrageous numerators

Define

Drop rating_numerator values in the range of < 11 and >20

Code

Test

Issue 4

Wrong (unusual) dog names such as 'not', 'a', 'by' instead of 'None' for no dog names

Define

Change wrong dog names to 'None'

Code

Test

Issue 5: Timestamp has a wrong datatype

Define: Convert timestamp datatype from string to datetime datatype

Code

Test

Issue 6:Source and text info to be dropped

Define: Drop 'source' and 'text' columns

Code

Test

Issue 7: 'Id' column in twitter_api_data and tweet_id contain same data but with different column names. 'id' should be renamed

Define: Rename 'id' column in twitter_api to 'tweet_id'

Code

Test

Issue 8: Year should be extracted from timestamp column and saved in new column 'year'. Timestamp column should be dropped.

Define: Extract year from 'timestamp'. Save to new column 'year'. Drop 'timestamp'.

Code

Test

Tidiness issues

Issue #1: The different names of dog stages in individual columns instead of being in a single column

Define: Use melt to create new column with doggo, floofer, pupper and puppo as values

Code

Test

Issue #2: The twitter_archive and twitter_api must be merged

Define: Merge twitter_archive and twitter_api

Code

Test

Define: Drop rows will 3 false predictions

Code

Test

Define: Make all strings in p1, p2 and p3 lower cases and replace empty space and hyphen with underscore

Code

Test

Define: Drop all columns that will not be used in analysis

Code

Test

Storing Data

Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

Analyzing and Visualizing Data

Research questions:

  1. What were the most popular breeds predicted by the image prediction model?

  2. What was the most popular dog stage in the twitter_archive dataframe?

  3. Were the ratings affected by the retweet counts and favorite counts?

  4. Which year has the most tweets?

  5. How has retweet count changed over the years?

Read the master csv files into pandas dataframes

We will use 'twitter_archive_master' for analysis 1 to 4 (as seen below)

From the histogram above, we see that most of the data (tweets) we have is from the year 2016

Insight 1: The scatter plot shows that 2016 is the year with the highest retweet_count and 2015 was quite low

2015 still had a low favorite_count and 2016 and 2017 saw higher favorite_count

Insight 3: Although, 12 is the most occuring rating, 13 is actually the rating with the highest number of retweet counts and favorite counts.

We use the image_predictions_master dataframe to do the analysis below

Some simple analyses and visualizations have been done